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Abstract 

Two sequentially Markov coalescent models (SMC and SMC’) are available as tractable approximations 
to the ancestral recombination graph (ARC). We present a Markov process describing coalescence at two 
fixed points along a pair of sequences evolving under the SMC’. Using our Markov process, we derive a 
number of new quantities related to the pairwise SMC’, thereby analytically quantifying for the first time 
the similarity between the SMC’ and ARC. We use our process to show that the joint distribution of pairwise 
coalescence times at recombination sites under the SMC’ is the same as it is marginally under the ARC, 
which demonstrates that the SMC’ is, in a particular well-defined, intuitive sense, the most appropriate first- 
order sequentially Markov approximation to the ARC. Finally, we use these results to show that population 
size estimates under the pairwise SMC are asymptotically biased, while under the pairwise SMC’ they are 
approximately asymptotically unbiased. 

Keywords: Sequentially Markov coalescent. Ancestral recombination graph, consistency, ergodicity, 
Markov approximation 


1. Introduction 


Of the many models of genetic variation in the field of population genetics, few have as much relevance 
in the era of genomics as the ancestral recombination graph (ARC). The ancestral recombination graph 
models patterns of ancestry and genetic variation within sequences experiencing recombination under neutral 


conditions (Hudson, 1991 Griffiths and Marjoram 1997|. Under the formulation of Griffiths and 


Marjoram (1997), lineages recombine apart and coalesce together back in time to produce a graph structure 


describing the ancestral genealogy at each point along a continuous chromosome. While only a few simple 
rules govern the process, many aspects of the model are analytically intractable. 


WiUF and Hein (1999) provided a formulation of the ARG that proceeds across the chromosome (rather 


than back in time), producing the genealogy at each point sequentially. As with the back-in-time formulation 
of the ARG, at each point along the chromosome there is a local genealogy describing the ancestry of 
the sample at that point, and changes in the genealogy occur at recombination sites. In this sequential 
formulation of the ARG, a new lineage is produced wherever an ancestral recombination event is encountered 
along the chromosome. To produce a new genealogy at the recombination site, the new lineage is coalesced 
to the ARG representing the ancestry of all previous points along the chromosome. This dependence on all 
previous points makes the process non-Markovian along the chromosome and (together with a large state 
space) makes calculations often intractable. 

Approximations to the ARG have been suggested with the goal of modeling coalescence with recombi¬ 
nation in a way that is analytically tractable. McVean and Gardin (2005) introduced the sequentially 


Markov coalescent (SMC). The original formulation of the SMC was sequential, generating genealogies along 
the chromosome such that each new genealogy depends only on the previous genealogy. Like the ARG, the 
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SMC has both a back-in-time formulation and a sequential formulation. The back-in-time formulation of 
the SMC is equivalent to that of the ARC except that coalescence is allowed only between lineages contain¬ 
ing overlapping ancestral material. As a consequence, in the sequential formulation of the pairwise (n = 2 
chromosomes) SMC, each recombination event produces a new pairwise coalescence time. 


Marjoram and Wall (2006) introduced a slight modification to the SMC, termed the SMC’, which 


retains the Markov behavior along the chromosome but models additional coalescence events that make it 
a closer approximation to the ARC. Specifically, in the back-in-time formulation of the SMC’, coalescence 
is allowed between lineages containing either overlapping or adjacent ancestral material. In the sequen¬ 
tial formulation of the pairwise SMC’, this means that not every recombination event necessarily produces 
a new pairwise coalescence time, since two lineages created by a recombination event can coalesce back 
together. Figure shows the transitions that are permitted under the back-in-time and sequential formu¬ 
lations of the pairwise ARC, SMC, and SMC’. The sequentially Markov coalescent models have been used 
in many recently introduced population-genetic, model-based inference procedures, including the pairwise 


SMC (PSMC) model ( 

Li and Durbin, 20111, multiple SMC (MSMC) model ( 

Schiffels and Durbin 

2014), diCal (Sheehan et al. 

2013 

), coalHMM ( 

Hobolth et al.\ 2007| Dutheil et al. 

2009 

), ARG Weaver 


(Rasmussen et al.\ 

201-^ 

; 1, and in a study of past human demography based on tracts of identity by state 

(Harris and Nielsen 

>013). 
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Figure 1: Transitions permitted under the pairwise ARG, SMC’, and SMC models. Under “Sequential Transitions,” a transition 
occurs left to right across the chromosome at the rightmost recombination event (marked with red line). The ith coalescence 
time is labeled as U. Under “Back-in-time transitions,” the arrow indicates a coalescence event that occurs between two aligned 
ancestral chromosomes, each carrying a combination of ancestral (solid black lines) and non-ancestral material (dashed gray 
lines). Ancestral material is defined as a portion of a chromosome that is ancestral to the sample. 


The SMC’ was shown by simulation to produce measurements of linkage disequilibrium more similar 
to the ARG than those produced by the SMC (Marjoram and Wall 2006), and was hence used as the 


preferred model by some recent studies (Harris and Nielsen, 2013 Schiffels and Durbin, 2014 Zheng 
et al. 2014). Additionally, a number of recent studies have explored the theoretical properties of the SMC’ 
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(Eriksson et al. 

2009 

Harris and Nielsen 

2013 

Carmi et al. 2014 

Schiffels and Durbin 

2014 

Zheng et al. 

2014 

). However, few direct comparisons between the SMC’ and the ARC have been made, and 


there remain a number of open questions. Here, we show how the joint distribution of pairwise coalescence 
times at two fixed points along a chromosome evolving under the SMC’ can be described by a continuous¬ 
time Markov chain. Through analysis of this Markov chain, we calculate many statistical properties of the 
pairwise SMC’ and compare them to those of the ARC and SMC. Specifically, for each model of coalescence 
with recombination, we compare the following: the joint density /Ti,T 2 (H,i 2 ) (Section |2.2[ ), the conditional 
density /t 2 |Ti(^ 2 |H) (Section |2.3[ ), and the covariance between Ti and T 2 , which we show to be equal to the 
probability that Ti and T 2 are the same (Section 2.41. These quantities are readily related to measures of 
linkage disequilibrium in real sequence data. 

Using our two-locus Markov process for the two-locus, pairwise SMC’, we also show that the joint dis¬ 
tribution of coalescence times immediately to the left and right of a recombination event is the same under 
the SMC’ and ARC. This allows us to calculate the asymptotic bias of the pairwise SMC- and SMC’-based 
population-size estimators, which we confirm by simulation. We show that the SMC’ estimator is approxi¬ 
mately asymptotically unbiased. 


2. Results 


2.1. Two-locus Markov chain model for the SMC and SMC’ 

Here, we present back-in-time Markov processes for the two-locus SMC and SMC’. Previous work has 


developed analogous two-locus, back-in-time Markov processes for the ARC. Kaplan and Hudson (1985) 


first described how the process of generating coalescence times at two linked loci modeled by the ARC 
can be represented as a continuous-time Markov chain, with coalescence and recombination events causing 


transitions between states. Simonsen and Churchill (1997) explored this process further for the case where 
the sample size is n = 2 and derived for the ARC many of the quantities we compare against the SMC’ in 
this paper. Subsequent work has extended this approach to study two-locus coalescence distributions in the 


presence of population structure (Eriksson and Mehlig 2004) and recurrent bottlenecks (Schaper et al. 


2012), and to study species-tree concordance at linked loci (Slatkin and Pollack 2006) and coalescence 


histories at one locus conditional on the history at a nearby locus (Hobolth and Jensen 


20141. 


We begin by presenting the simpler SMC model, which will provide context for the more-complex SMC’ 
model. If time is scaled such that the rate of coalescence is 1 and the total rate of recombination between the 
two linked loci is p/2, then the two-locus ancestral process under the SMC is the process depicted in Figure]^ 
The process starts in state Rq with two lineages, each containing linked copies of the two loci. From Rq, 
the process transitions with rate p to state Rx, in which one of the two chromosomes has experienced a 
recombination event, or with rate 1 to state Cb, an absorbing state in which both loci have coalesced. Under 
the SMC, lineages can only coalesce if they contain overlapping ancestral material, so after entering Rx, the 
process cannot return to the fully linked state Rq, and each locus coalesces independently with rate 1 from 
that time onward. Thus, under the SMC, the joint distribution of coalescence times at two loci is that of 


(Ti,T 2 ) - (Ao + RXl, Xq + RXr), 


( 1 ) 


where Xq ^ Exp(l -|- p) is the amount of time to leave Rq, R ^ Bernoulli( indicates whether the first 
event is a recombination event, and Xr ~ Xr ^ Exp(l) are the exponential waiting times until coalescence 
after the first recombination event. All of these random variables are independent in the SMC model, so it 
is straightforward to calculate many of the quantities we compare in this paper using this representation. 

The defining rule of the SMC’ model of coalescence with recombination is that only ancestral lineages 
containing overlapping or contiguous ancestral material can coalesce (Marjoram and Wall 2006). The 


back-in-time process of coalescence at two fixed loci under this model is the continuous-time Markov chain 
shown in Figure Under the SMC’, it is necessary to model the number of recombination events that 
have occurred between the two loci at each point in time. To see that this is the case, consider the state 
R 2 in Figure In this state, two recombination events have occurred between the focal loci, and neither 
focal locus has coalesced. Because lineages can only coalesce to lineages containing overlapping or adjacent 
ancestral material, two particular coalescence events would need to occur before the process returns to state 
Rq, regardless of the placement of the recombination events on the two chromosomes. 
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The SMC’ two-locus Markov process also features an additional state I, which is entered when some 
portion of the chromosome between the focal loci coalesces before either of the focal loci. Upon entering I 
it becomes impossible for the process to re-enter the initial, fully-linked state (Ro), so the remaining times 
until coalescence at the focal loci become independent exponential random variables with mean 1. If Ri is 
the state in which neither focal locus has coalesced and i recombination events have occurred between the 
focal loci, the transition rate into I is i — 1. This is due to the fact that each recombination event after the 
first produces an additional pair of lineages that can coalesce to take the process to I. For each state Ri, 
i> 1, the number of lineages that can coalesce to take the process to Ri—i is i, and the rate of transitioning 
to Ri-\-i through recombination is p. Transitions to Cl and Cr occur at rate 1 whenever the process is 
in state Ri, i > 1. Following Eriksson and Mehlig (2004), we disregard any information about linkage 
between the two loci after one locus has coalesced, since the rate of coalescence at the uncoalesced locus is 
1 regardless of the state of linkage with the coalesced locus. 



Figure 2: Schematic of the SMC back-in-time Markov process for two loci. The process starts in state Rq, and transitions to 
other states occur with the rates indicated by arrows between states. 


For comparison, an analogous two-locus continuous-time Markov chain for the ARG is presented in 
An equivalent process was studied by Simonsen and Churchill (1997) and others. In this 
state Ri is reached when the first event is a recombination event, and state R 2 


Figure SI 
model 


is reached only 

after a subsequent recombination event occurs on the ancestral lineage that did not experience the first 
recombination event, making all ancestral copies of the two loci unlinked. 


2.2. Joint probability density functions 

Considering the SMC’ model above, let Ro{t) represent the probability that the two-locus ancestral 
coalescent process is in state Rq at time t, and let R'^{t) represent the probability that the process is in any 
state Ri, i > 1 , 01 state I, at time t. The joint density of coalescence times at the two focal loci is then 


( i?o(tl) ti — t 2 

fT„T,{tuh) = < ti < t2 

ti>t2. 


( 2 ) 


since i?o(t) is the rate of entering state Cr at time t, and R^{t) is the rate of entering either Cl or Cr 
at time t. The joint density for the ARG and SMC is analogously defined, with R'^{t) representing i?i and 
R 2 under the ARG and Ri under the SMG. For the ARG and the SMG, the number of states is finite and 
Ro{t) and R'^{t) can be solved using matrix exponentiation. For the SMC’, there is an infinite number of 
states, representing the possibility of an infinite number of recombination events occurring between the two 
focal loci. In the Appendix, we derive closed-form expressions for Ro(t), R'^{t), and /ti.Ts(U,^ 2 )- The main 
idea in these derivations is to recognize the structure of the SMC’ in Figure 3 as a birth-death process with 
killing. In this formulation the states are Ri ,{i > 0}, a birth corresponds to a recombination event (and the 
birth rate is constant), a death corresponds to a coalescence event (and the death rate is linear), and killing 
corresponds to leaving the Ri states. 
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Figure 3: Schematic of the SMC’ back-in-time Markov process for two loci. Dashed arrows show transition rates that apply for 
all Ri. State I is the state in which some portion of the chromosome between the two focal loci has coalesced but neither focal 
locus has coalesced. The red lines in states fi2 and Rs show the coalescence events that take the process to state I. 


Figure [^compares the joint coalescence time distributions under the SMC and SMC’, displaying the dif¬ 
ferences of these joint distributions with the joint distribution of the ARC. The SMC’ provides a much better 
fit to the ARC for the range of recombination rates compared. Both the SMC and the SMC’ underestimate 
the density of outcomes where Ti = T 2 , but this underestimation is substantially less under the SMC’. 

To summarize the difference between the joint distributions more succinctly, we calculated the total vari¬ 
ation distance between the SMC and ARC and between the SMC’ and ARC across a range of recombination 
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Figure 4: Comparison of the difference in the joint density of coalescence times /ti ,T 2 (^i»^ 2 ) between the SMC and ARG (top 
row) and SMC’ and ARG (bottom row). Comparisons are made for three different recombination rates (p = 0.1,1.0, 5.0). 


rates. The total variation distance between the SMC and the ARG is defined as 

-| poo poo 

TV{SMG,ARG) = - / (3) 

^ Jo Jo 

where and are the joint densities /ti,T 2 (^i,^ 2 ) defined under the SMC and ARG, 

respectively. The total variation distance between the SMC’ and ARG is similarly defined. Figure shows 
the total variation distance from the ARG for the SMC and SMC’ over a range of recombination rates. Total 
variation distances were calculated numerically. For both the SMC and SMC’, the total variation distance 
was maximized at some intermediate recombination rate, approximately p = 1.1 for the SMC and p = 3.2 
for the SMC’. 


2.3. Conditional distribution of coalescence times 

In this section we consider the distribution of coalescence times at one locus given the coalescence time 
at the other. The conditional density of T 2 given Ti, /T 2 |Ti(i 2 |ti), can be calculated by dividing the joint 
density by the marginal distribution of coalescence times at the left locus: 


/T2|Ti(i2|tl) - 


fTi,T2(tl,t2) 


( 4 ) 


Hobolth and Jensen (2014) introduced a framework for modeling the distribution of T 2 given Ti using 


a time-inhomogeneous continuous-time Markov chain. (Note that the model called SMC’ in Hobolth and 
Jensen (2014) is an SMC’-like model of two loci that is not based on the continuous-chromosome SMC’. 
It is different from the SMC’ model we consider here.) This framework can be extended to the SMC’, 
producing the continuous-time Markov chain shown in Figure [S^ Figure [^compares the conditional density 
/t2 ITiihlti) of coalescence times ^2 at the right locus conditioned upon the coalescence times ti at the left 
locus for different values of ti and p. 

We note that recently it was proposed that the mutation rate could be estimated by simulation-based 
calibration of the increase in mean heterozygosity when moving away from a site of known, low heterozygosity 
(Lipson et al. 2015). Our expressions for the conditional distribution of coalescence times could provide 


theoretical expectations for such a statistic. 
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Figure 5: Total variation distance between the SMC and ARC (solid line) and the SMC’ and ARC (dashed line) as a function 
of recombination rate. Total variation distances were calculated numerically. 
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Figure 6: Comparison of densities of coalescence times f 2 at the right locus conditioned upon coalescence times ti at the 
left locus. Conditional densities /t 2 |Ti (^ 2 |fl) are shown for the ARG, SMC, and SMC’ models for three different rates of 
recombination between the two loci (p = 0.1,1.0, 5.0) and three different conditioned-upon coalescence times ti at the left locus 
(tl = 0.1,1.0,4.0). The area under each curve is P{T 2 yf ti|Ti = fi); the conditional probabilities P(T 2 = ti|Ti = ti) are not 
shown. 


2.4- Probability of coalescence times being equal to covariance between coalescence times 

In the two-locus, back-in-time Markov processes for the SMC, SMC’, and ARG, Ti and T 2 are equal 
when the state Cb is entered through Rq rather than Cl ov Cr. For the ARG, [SiMONSEN and Churchill] 
(1997) showed that the probability that Ti is equal to T 2 is 


-PARG('ri = 72) = 


p-bl8 


p2 -b 13p -b 18 


Under the SMC ( [McVean and Cardin[ |2005l , 

Psmc{Ti = 72 ) = 


1 -bp 


(5) 


( 6 ) 


Eriksson et al. (2009) used the sequential formulation of the SMC’ to show that 
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where A(t) = | (l — e + 2i) is the exponential rate of encountering a change in coalescence time when the 
local coalescence time is t and r(a, b) = x°‘~^e~^dx is the incomplete gamma function. 


For the ARG and SMC, the covariance Cov[Ti,T 2 ] is equal to P{Ti = T 2 ). Eriksson et al. (2009) 


showed by simulation that this is also true of the SMC’. Here we present a short proof that this is the case 
for any two-locus model of coalescence where the marginal distribution of coalescence times is exponential 
with rate 1 . 

The expectation E[TiT 2 ] can be derived using the fact that (a — 6 )^ = + b'^ — 2ab: 


2E[Tir2] = E[T^] + E[T|] - E[(Ti - T2 f] 

= 2 + 2 - E [(Ti - T2f\Ti ^ T2]P(Ti ^ T 2 ) (8) 

= 4-2P{Ti^T2). 


The final equality in ([^ follows from the fact that |Ti — T 2 I has an Exponential distribution with rate 1 
when Ti ^ T 2 . Therefore E[TiT 2 ] = 2 — P{Ti 7 ^ T 2 ) and 


Cov[ri,T 2 ] = E[riT2] - E[ri] e[T2] 

= E[riT 2 ] - 1 (9) 

= P{T,=T2). 


This result holds in other situations with exponential coalescence times, for example in the context of the 
population-divergence model considered by Eriksson et al. (2009) (in which case the m arginal distribution 
is exponential plus a constant) and for the various covariances used by McVean (2002) to calculate cr^, the 
approximation to the linkage disequilibrium measure r^. 

It is interesting to consider Cov[Ti,T 2 ] = P{Ti = T 2 ) when p is small. For the ARG, consideration 
of ([^ shows that Cov[ri,T 2 ] = Parg(7"i = ^ 2 ) = 1 — 2p/3 -I- 0{p^). Likewise, for the SMC, ([^ shows that 
Cov[ri, T 2 ] = Psmc(Ti = T 2 ) = 1 — p + O(p^). For the SMC’, the integral representation of Psmc’{Ti = T 2 ) 
in 0 allows for the calculation of this quantity as a first-order expansion in p: 


poo 

Cov[ri,T 2 ] = Psmc’(Ti = T 2 ) = / 

Jo 


P J e *X{t)dt + O(p^) = 1 - ^ -p 0{p^). 


( 10 ) 


Thus, Cov[Ti,r 2 ] (or P{Ti = T 2 )) is the same up to order p^ under the ARG and SMC’. 

2.5. Coalescence times at recombination sites 

In this section, we show that the joint distribution of coalescence times on either side of a recombination 
event is the same under the SMC’ and marginally under the ARG, and we derive this distribution. Consider 
the continuous-time Markov chains representing the two-locus SMC’ and ARG models (Figures and SI 
respectively) in the limit of p —>■ 0 and conditioning on the first event being a recombination event. These 
processes represent the joint distribution of coalescence times on either side of a recombination event under 
the ARG and SMC’. In both of these processes, the waiting time until the first event, conditional on that 
event being a recombination event, has an exponential distribution with rate 1 -I- p, which converges to 1 
as p —>■ 0. After that first recombination event, the rate of all additional recombination events converges to 
















zero in the p —>■ 0 limit, so all of the remaining events must be coalescence events, each of which occurs with 
rate 1. Under the ARG and the SMC’, the coalescence events that are possible from state -Ri are the same. 
Thus, the ioint distribution of coalescence times at recombination sites is the same under the SMC’ and the 
ARG. 

Figure]^ shows the two-locus continuous-time Markov chain representing this conditional process. This 
Markov chain starts in a special initial state Rg, out of which the first event is always a recombination event, 
which happens with rate 1, as described above. In previous sections, we used Ti and T 2 to represent the 
coalescence times at two loci some fixed distance apart. To avoid confusion, in this section we use S and T 
to represent the coalescence times on the left and right sides of a recombination event, respectively. 

A B 




Figure 7: Two-locus continuous-time Markov chains representing the ARG and SMC’ models in the p —> 0 limit, conditional 
on the first event being a recombination event. These processes represent the joint distribution of coalescence times on either 
side of a recombination site under the ARG and SMC’. The state Rq is a special starting state out of which the first event 
is always a recombination event. Panel A shows the process unconditional on whether S = T, and Panel B shows the process 
conditional on S yf T. The model representing the joint distribution of coalescence times at recombination sites under the SMC 
is equivalent to the model in Panel B with the transition rates from Ri to Cl and Cr equal to 1 instead of 3/2. 


Recombination events are visible in sequence data only if they change the local coalescence time. Thus, 
it is of special interest to condition on S' 7 ^: T in the above model in order to derive the joint distribution of 
coalescence times on either side of a change in coalescence times under the ARG and SMC’. Conditioning on 
S ^ T, the transition out of Ri must be into either Cl or Cr. These transitions occur with conditional rate 
3/2, since the total rate of leaving Ri is three in the unconditional model, and two of the ways of leaving 
Ri result in the coalescence times being different. 

The continuous-time Markov chain representing coalescence times on either side of a change in coales¬ 
cence times (i.e., at recombination sites where S ^ T) is shown in Figure]^. Under this model, the joint 
distribution of S and T is that of 

{S,T)^ (A1+A2 + RA3, Xi+A2 + ( 1 -R)X 3 ), (11) 

where Xi ^ Exp(l), X 2 Exp(3), R ^ Bernoulli(1/2), X 3 ^ Exp(l), and the random variables are 
independently distributed. 

Under the SMC, the continuous-time Markov chain representing the joint distribution of coalescence 
times at recombination sites is equivalent to the model in Eigure [31 with the transition rates from Ri to 
Cl and Cr equal to 1 instead of 3/2. Under this model for the SMC, the joint distribution of coalescence 
times on either side of a recombination event is that of 

(5,T)~(Ai+A2,Ai+A3), (12) 

where Ai, A 2 , and A 3 are all mutually independent exponential random variables with rate 1 . 

In the Supporting Information, we use these Markov processes to derive the joint, marginal, and con- 
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ditional distributions of coalescence times at recombination sites under the ARG, SMC’, and SMC. These 


calculations confirm previous derivations of Carmi et al. (2014) for the SMC’ and Li and Durbin (2011) 
for the SMC. 


2.5.1. SMC’ as the canonical first-order Markov approximation to ARG 

Under the sequential formulation of the continuous-chromosome ARG, SMC, and SMC’ models, the 
infinitesimal probability of a recombination event occurring in the interval (x, x -\- dx) given the coalescence 
time s at cc is sdx. This fact, together with the fact that the joint distribution of coalescence times at 
recombination sites is the same under the ARG and SMC’ (whether or not the coalescence time changes), 
implies that the conditional distribution of coalescence times at point x dx given the coalescence time at 
point X is the same under the SMC’ and ARG. 

This fact demonstrates that the pairwise SMC’ is the canonical hrst-order Markov approximation for the 
pairwise ARG. Given an infinite-order Markov chain {Xi,z = 0,l,2,...}, where the distribution of each Xj 
depends on all previous Xi, i < j, the canonical fc-order Markov approximation to {Xi} is the Markov chain 
{xl^^} satisfying 


= X„_i, . . . = CCn-l, . . . 


(13) 


That is, the transition probabilities under the fc-order canonical Markov approximation are equal to the 
transition probabilities conditional on the previous k states under the infinite-order chain. See Schwarz 


(1976), Fernandez and Calves (2002), and Gallo et al. (2013) for examples of mathematical studies of 


canonical Markov approximations of infinite-order Markov chains. 

Here we informally extend the terminology of canonical Markov approximations to continuous processes. 
The SMC’ is the canonical first-order Markov approximation to the ARG because the distribution of coales¬ 
cence times at x dx conditional on the coalescence time at x is the same under the ARG (an inhnite-order, 
sequentially non-Markovian continuous process) and the SMC’ (a first-order sequentially Markov continuous 
process). In this sense, the SMC’ is the most natural first-order sequentially Markov approximation to the 
ARG. 


2.6. Asymptotic bias of the population-size estimators under SMC and SMC’ 

Given the joint density of pairwise coalescence times at recombination sites under the ARG, it is possible 
to determine the asymptotic bias of maximum-likelihood population size estimators derived from the pairwise 
SMC and SMC’ likelihood functions. These likelihood functions give the probability of observing a sequence 
of pairwise coalescence times and corresponding segment lengths across a chromosome under the SMC and 
SMC’ models. Related likelihood functions (allowing for variable historical population size) are implicitly 
maximized in the PSMC and MSMC inference procedures ( Li and Durbin[ 2011[ Schiffels and Durbin 
2014, respectively). These inference procedures are hidden Markov model (HMM) methods in which the 


local coalescence times (or genealogies) and segment lengths are hidden states inferred from sequence data. 
Here, we consider the estimators that would be obtained if the hidden states in these models were actually 


observable (see also Kim et al\ 2015). We are motivated by the fact that any biases of the estimators we 
investigate are likely to be inherent in the full HMM-based inference procedures, since these biases would be 
present even with perfect knowledge of an infinite number of coalescence times. Furthermore, by analyzing 
estimators derived from the hidden coalescence states, we isolate the bias that is due to choice of coalescent 
algorithm (SMC vs. SMC’) from the bias due to the mutation model or discretization of the continuous 
hidden states in a full HMM approach to inference. 

To investigate the asymptotic properties of these estimators, we assume that data are generated under the 
ARG, such that at a fixed point the distribution of pairwise coalescence times is exponential with rate equal 
to 1 and an ancestral segment of length I recombines back in time at rate pi 12. Segment lengths are measured 
in units of the true scaled recombination parameter p. Data generated under this model can be represented 
as a sequence of pairwise coalescence times and corresponding segment lengths: {{U, U) '. 1 < i < fc}. 

We are interested in estimating a single relative population size p (dehned relative to the true population 
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size, N). If the data are modeled by the SMC or SMC’, the likelihood of a particular value of rj is 

^ k k 

^ i=2 i=l 

where (7(t|s) is the transition function and X{t]r]) is the rate of encountering the end of a segment given 
t, with both quantities pertaining to the sequentially Markov coalescent model being used to calculate the 
likelihood. 

In the Appendix, we show that if the SMC is used, the maximum-likelihood estimate of rj converges to 
approximately 0.95 as the chromosome gets infinitely long. If the SMC’ is used, the estimate is approximately 
unbiased in the same limit. If the data are reduced to just the segment ages, the likelihood equation is 

1 ‘ ^ 

LivlUUJi)}) = -e~^ Y]_q{ti\U-i;r]). (15) 

Using this reduced likelihood, the asymptotic maximum-likelihood estimate is asymptotically unbiased under 
the SMC’. Under the SMC, the reduced likelihood and the full likelihood produce the same maximum- 
likelihood estimate (see Appendix). 

We confirm the asymptotic bias of the SMC estimator and the apparent lack of asymptotic bias of 
the SMC’ estimators by simulation. Figure shows 100 simulated estimates calculated using the SMC, 
SMC’, and reduced SMC’ likelihood functions. Each estimate was calculated using 100 independent pairs of 
chromosomes simulated under the ARC, with each chromosome of total length 4A^r = 1000, where N is the 
diploid size and r is the per-generation probability of recombination. Likelihood functions were multiplied 
across independent pairs of chromosomes, and the same set of simulations was used to produce the estimates 
for all three likelihood functions. 



Likelihood function / estimator 

Figure 8 : Maximum-likelihood estimates of relative population size with three different Markov chain likelihood functions. 
For each simulation, the segment lengths and coalescence times were taken from 100 independent pairs of chromosomes, with 
each chromosome being of length p = ANr = 1000 simulated under the ARG. A maximum-likelihood estimate was calculated 
using the SMC, SMC’, and times-only SMC’ likelihood functions (equations | |25| |, ( |26| l, and respectively). The true scaled 
population size is 77 = 1, shown with the dashed blue line. The predicted asymptotic bias of the SMC likelihood function 
(fj ^ 0.95) is shown with a solid blue line. The sample mean of the estimates calculated with each likelihood function is shown 
with a solid red line. A total of 100 simulated datasets were analyzed. 
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3. Discussion 


We have presented a continuous-time Markov chain that describes the pairwise coalescence times at two 
fixed loci evolving under the SMC’ model of coalescence with recombination. We analyzed this Markov 
chain to derive the joint distribution of coalescence times at the two loci and the conditional distribution 
of coalescence times at one locus given the coalescence time at the other. We compared these distributions 
to those of the ARC and SMC models and found that the difference between the ARC and the SMC’ was 
much less than the difference between the ARC and the SMC. 

We showed that the conditional distribution of coalescence times at point x + dx given the coalescence 
time at x is the same under the ARC and SMC’. This implies that the SMC’ is the canonical first-order 
approximation to the pairwise ARC. However, this correspondence is true only of the continuous-chromosome 
models. If instead the ARC is a model of the genealogies at a sequence of discrete loci, then the first-order 
canonical Markov approximation is the Markov approximation obtained by modeling a conditional ARC 
between every successive pair of loci. This model was studied by Hobolth and Jensen (2014|, who referred 
to the model as a “natural” Markov approximation to the ARC. Conceptually similar sequentially Markov 
coalescent models have been used in the so-called “coalescent hidden Markov model” family of methods 


Hobolth et al. 

2007 

Dutheil et al. 

to 

o 

o 

Mailund et al. 2011 

I- 


Chen et al. 

O 

O 

presented a method of simulating data under higher-order sequentially Markov ap 


proximations to the ARC, where the ARC of some number of preceding loci is retained in the process of 
generating the marginal genealogy at a given locus. They showed by simulation that higher-order approx¬ 
imations generate times until most recent common ancestry that are more consistent with the ARC than 
do lower-order approximations, but little theoretical work on these higher-order Markov approximations has 
been done. 

The two-locus Markov chains analyzed in this paper assume a single well-mixed population, but natu¬ 
ral populations often have more complex demographic histories, featuring, for example, variable historical 
population sizes, migration between subpopulations, and/or past divergence from other populations. The 
theoretical properties of the sequential, across-the-chromosome formulations of the pairwise SMC and SMC’ 


with variable population sizes have been studied previously (Li and Durbin 2011 Schiffels and Durbin 


20141. Eriksson et al. (20091 used simulation to study two-locus properties of the SMC’ with population 


bottlenecks, migration between subpopulations, and divergence between populations. They found that the 
SMC’ generally performs well in these contexts. The two-locus Markov chains we study here could be ex¬ 


tended to include these features (as was done for the ARC by Lessard and Wakeley (20031 and Eriksson 


and Mehlig (2004)), which would provide a framework for calculating exact quantities for the two-locus 


SMC and SMC’ in the context of structured populations. We leave this for future work. 

We calculated the asymptotic bias of a population size estimator under the pairwise SMC to be approxi¬ 
mately 95% of the true population size. This is not a very large bias, but given the continued use of the SMC 


model in population-genomic inference methods (Palamara et al. 2012 Sheehan et al.. 2013 Rasmussen 


et al. 2014), there is an apparent need to re-examine the consequences of using the simpler SMC model 


instead of the slightly more complicated SMC’ model. For example, it will be important to consider whether 
including the possibility of varying population sizes will increase or decrease asymptotic bias. In this context, 
using the SMC as a basis for a likelihood function may also bias the estimates of the magnitude and timing 
of population size changes, since the longer segments produced by the ARC will seem younger when they 
are modeled under the SMC. 

Depending on the particular application, it may sometimes be mathematically difficult to employ the 
SMC’ instead of the SMC. Nevertheless, the SMC’ is the model underlying two recently introduced population- 
genetic inference methods: the multiple SMC (MSMC) method of Schiffels and Durbin ( 2014[ ) (which 
simplifies to a PSMC’ inference procedure when the number of haplotypes is two) and a procedure based 
on the distribution of distances between heterozygous bases, introduced by Harris and Nielsen (2013). In 
each case it was acknowledged that the SMC’ provided more accurate results than the SMC. 

From the arguments that led to the development of the continuous-time Markov chains representing the 
joint distribution of coalescence times at recombination sites (Figure]^, it seems that the joint distribution 
of coalescence times on either side of a recombination event will be the same under a variety of demographic 
scenarios. If one were to allow variable population historical size, the waiting time until the conditioned- 
upon recombination event would still be the same under the SMC’ and ARC, and the remaining coalescence 
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events would also be distributed identically. Likewise, when there is population substructure with migration 
between subpopulations, the distribution of events occurring at recombination sites should be the same 
under the SMC’ and ARC. Finally, when there are more than two haplotypes sampled, it seems that the 
joint distributions of genealogies on either side of a recombination event would be the same between the SMC’ 
and the ARC marginally. These ideas need to be properly explored in future studies, but they suggest that 
asymptotic bias due to using the SMC’ in place of the ARC will be minimal under a variety of demographic 
scenarios. 
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6. Appendix 

6.1. Derivation of joint density of pairwise coalescence times at two loci 

To calculate the joint density of coalescence times, it is necessary to calculate Rj{t), the probability that 
the SMC’ two-locus Markov process (Figure]^ is in state Rj at time t, and I{t), the probability that the 
SMC’ process is in state I at time t. To solve for Rj{t), one can use the forward Kolmogorov equation (for 
J > 1) 


— P^j-iW + U + ~ (2j + 1 + p)Rj{t)- 


(16) 


Through substitution, the solution to ([T^ can be shown to be 


Rj{t) = Ro{t) 




(17) 


To hnd Ro{t), we note that it is equal to /ri.Ta {t, t) (see Eq. ([^). In turn, /ti,T 2 (t, t) = fxi {t)P{T 2 = t\Ti = t), 
where /ti (t) = is the marginal distribution of coalescence times at the first (or second) locus and 
P{T 2 = t\Ti = t) = is the probability of no change in coalescence times given the coalescence time t 

at the first locus. Here A(t) = | (l — -I- 2t) is the exponential rate of encountering a change in coales- 


et al. 2014). Thus Ro{t) is given by 


cence time along the chromosome given that the local coalescence time is t (|Eriksson et al. 2009 Carmi 

Roit) = 


This completes the solution of Rj{f). Using Figure]^ 


i7+(t)=/(t)-b^i?,(f). 


(18) 


(19) 


1=1 


where I{t) is the probability that the process is in state I at time t. Using 0 and ([T^ we get 


Y,Rjit) = Ro{t)j2 

1=1 1=1 


[f(l-e- 2 *)]^ 


( 20 ) 




3f(l-e-^‘)-l 


Next, I{t) satisfies the forward Kolmogorov equation 


I'it) = Y.(j-l)R,it)-2Iit), 

1=2 


( 21 ) 


the solution to which is 


7(t) = e-^‘ / e2“V(j-l)i?,(u)du 
Jo ^.^2 

= £ Roiu) {2e^“ + [{p - 2)e^- - p] } 


du 


( 22 ) 


= 1 - eH-2‘(p-2)-Kp-e-"*p) 


_p ,_p^ 

-e ^2 2 (-p) 4 


r ( _rf e 

4 ’ 4/ V 4 ’ 4 


Here, r(a, 6) = a;“ ^dx is the incomplete gamma function. 
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Together (18), (19), ( [^ , and (22) give the joint distribution ([^ for the SMC’. For the ARC and SMC, 
the quantities analogous to Ro{t) and R^{t) for these models can be obtained by exponentiating the rate 
matrices implicit in Figures and For the SMC, the joint distribution can also be derived using the 
representation 0- 

The walk on the states Rq, R 2 , ■ ■ ■ constitutes a birth-death process with killing, where birth events 
correspond to additional recombination events taking the process from Ri to death events correspond 

to coalescence events that take the process from Ri to Ri—i, and killing events, which take the process to 
an absorbing state, here correspond to coalescence events that take the process to Cl, Cr, or I. Under this 
formulation, the birth rate is constant \i = p, the death rate is linear pi = i, and the killing rate is linear 
= i -|- 1. This class of processes was studied by VAN Doorn and Zeifman] (2005 ), who demonstrated a 
different approach for calculating Ri(t). This alternative approach (not shown) confirms our derivation of 

p). 


6.2. Calculating asymptotic bias 

We are interested in estimating a single relative population size ij (defined relative to the true population 
size, N), which must be incorporated into the transition density function q{t\s) at recombination sites under 
the SMC and SMC’. Under the SMC, this transition density function is 


qsMc{t\s','n) = 


i(l-e-*/'') t<s 

ie-(‘-U/'?(i _ e-^/’?) t>s. 


(23) 


This is equivalent to the conditional density (S 6 ) with the addition of a relative population size parameter. 
Under the SMC’, the transition function is 




gsMC'(f|s; rj) = 


l + 4f-e- 


2 


(l_e-2V,) 


1 + ^-e- 


t < s 


t < s, 


(24) 


which is equivalent to the conditional density (S3) with a relative population size parameter included. 


Under the SMC, given the local coalescence time t, the distance along the chromosome until the nearest 


recombination event (measured in units of p) is exponentially distributed with rate t (McVean and Cardin 


2005). The likelihood function for a single relative population size rj under the SMC is thus 


1 £ ^ A: 

LsMc{'ri\{{U,li)}) = -e~^ Wqsuc{ti\ti-i,'n)W_tie 

^ i=2 i= 

1 ‘ ^ 

oc -e~^ 0<7SMc(U|U-i;?7)- 


— tdi 


(25) 


Under the SMC’, the likelihood function for a relative population size 77 is 

k k 


1 £ ^ fc 


(26) 


where X{t,r]) = j[ri{l — e + 2£] is the exponential rate of encountering recombination events that 

change the coalescence time when the local coalescence time is t ( Eriksson et al. 2009). Note that under 
the SMC, the length li of a segment is independent of the relative population size rj given the local coales¬ 
cence time ti. This is not true for the SMC’, since the probability that the coalescence time changes at a 
recombination site depends on the population size. 

As the length of the chromosome increases and the number of coalescence-time changes goes to infinity, 
the asymptotic maximum-likelihood estimate fj* of the relative population size under the SMC is 
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1 T r 

fj* = lim argmax -e~^ gsMc(i*|ii-i; ??) 

fc-^-oo rj r] 


= lim argmax i log (-e ^ ^ log [gsMc(ii|ii-i; ??)] \ 

fc^oo ^ J 

k 

= lim argmax V'log [gsMc(i*|ii-i;»?)] 

= argmax Earg [ log(gsMc(7’|<S'; ??))] 

T^SMC'is)qsMC'{t\s; 1) log (gsMc(i|s; v)) dtds 


( 27 ) 


poo poo 


= argmax / 

r} Jo Jo 

« 0.95. 

Here the penultimate equality holds only if there is ergodic (i.e., law-of-large-numbers-like) convergence 
of the sequence of pairs of coalescence times on either side of a recombination site under the ARG. In the 
Supporting Information, we show that the continuous-chromosome pairwise ARG is ergodic. That is, the 
mean coalescence time across a long chromosome converges to the mean coalescence time at a single point 
along the chromosome. We are unable to prove the ergodicity of the sequence of pairs of coalescence times at 


recombination sites where the coalescence time changes; instead, we note that (27) is supported by simulation 


(see above). We also note that Wiuf (2006) proved the ergodicity of the discrete-locus ARG under a variety 
of neutral demographic models. A similarly in-depth proof may also apply for continuous-chromosome 
models, but we do not explore the point further. 


In (27), the ultimate equality follows from the fact that the joint distribution of coalescence times is 
marginally the same at recombination sites under the ARG and the SMG’. Numerical maximization of the 
double integral shows that the maximum-likelihood estimate of a single population size N under the pairwise 
SMG is asymptotically biased, with the asymptotic estimate being approximately 0.95A^. 

Under the ARG, the stationary distribution of lengths between recombination events that change the 
local coalescence time (i.e., the identity-by-descent segment length distribution) is slightly different from 
that of the SMG’. (They are different because subsequent recombination events “heal” with slightly different 
probabilities under the ARG, while under the SMC’, each subsequent recombination event “heals” with 
the same probability.) Under the ARG, the identity-by-descent (IBD) length distribution is not currently 
known. Given that under the SMC’ the maximum-likelihood estimator for a relative population size involves 
the observed lengths, it is not currently possible to calculate the asymptotic bias of the pairwise SMC’ 
maximum-likelihood estimator of a single population size. However, the IBD length distribution under the 


ARG is approximated very closely by the SMC’ IBD length distribution (Carmi et al. 2014), so the SMC’ 
estimator is likely to be nearly asymptotically unbiased. 
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1. Supporting Information 


1.1. Coalescence time distributions at recombination sites 

Here, we derive the joint, marginal (i.e., one-locus), and conditional distributions of coalescence times at 
recombination sites where the coalescence time changes under the ARG, SMC’, and SMC. The distributions 
related to the ARG and SMC’ are derived from analysis of the continuous-time Markov chains representing 
coalescence times at such recombination sites under these models (Figure]^. Under the ARG and SMC’, 
the joint density function of coalescence times at recombination sites that change the coalescence time (i.e., 
the joint density of S and T) is 


fs,T{s,t) = 


4 ( 1 - 


and the marginal density function of S (or T) is 


4 

— s 


(1 - e-2‘) e-'* 


s < t 
s > t, 


7 r(s) = qC ® (2s -I- 1 — e . 

8 

The conditional distribution of T given S is 


/T|s(i|s) = 


fs.rjsC) 

7r(s) 


2(l-e“’^‘) 


_e-2s_|_2s 

l_e- 2 =+ 2 s 


t < S 


t > S. 


(SI) 


(S2) 


(S3) 


Equations (SI), (|S2|), and (S3) hold marginally at recombination sites where the coalescence time changes 


under both the ARG and SMC’. Equations (S2| and (S3) were derived for the SMC’ by Carmi et al. (2014 


see eqns. (8) and (9), respectively), confirming our derivation. 

Under the SMC the process for generating coalescence times at recombination sites is equivalent to the 
continuous-time Markov chain in Figure with the transition rates from Ri to Cl and Cr equal to 1 
instead of 3/2. Under this model for the SMC, the joint density of coalescence times on either side of a 
recombination event is 


fs.risC) = 

and the marginal density of S (or T) is 


e *(1 — e s <t 
-"(l-e-‘) s>t 


7r(s) = se 

The conditional distribution of T given S under the SMC is 

' l-e~ 


fT\s{t\s) — 


fs,T{s,t) 

7r(s) 


1 e ®^(1 —e 


t < S 
t > S, 


which confirms the derivation of Li and Durbin] (2011 cf. their Eq. (S6)). 


(S4) 


(S5) 


(S6) 


1.2. Pairwise ARG is ergodic 

Here we show that the pairwise ARG is sequentially ergodic. Let {t(a;)}a;>o represent the random pairwise 
coalescence time at point x along two aligned, continuous, infinitely-long chromosomes modeled by the ARG. 
Let time be scaled such that the marginal distribution of t{x) is exponential with rate 1 for all a; > 0, and 
thus E[t(a;)] = 1. Let the distance across the chromosome be measured such that a segment of length I 
recombines apart back in time at rate 1/2. (Equivalently, a recombination event happens in the chromosome 
interval (a;, x + dx) in the time interval (t, t + dt) with infinitesimal probability dx dt.) 

One useful property of t{x) is that it is strongly stationary. That is, the joint distribution of {t{xy\a<x<b 
is the same as the joint distribution of {t{x)}a+h<x<b+h for all 0 < a < 6 and h > Q. To see that this is 
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the case, consider the WiUF and Hein (1999) algorithm for constructing an ARG sequentially across the 
chromosome: at a given point, a genealogy is drawn from the marginal distribution of genealogies, and then 
the algorithm proceeds along the chromosome generating recombination events and genealogies, where at 
each point along the chromosome, such events are drawn from the conditional distribution given all previous 
coalescence and recombination events. The initial point from which the marginal genealogy is drawn has no 
effect on the resulting joint distribution of genealogies. 

A stationary process t{x) is ergodic if the covariance function r{x) converges to zero as x goes to infinity 
(Karlin and Taylor[[I 9^). Under the ARG, the covariance function is 


r{x) 


a: + 18 

x^ + ISx + 18’ 


(S7) 


which satisfies this condition. Thus the pairwise ARG is sequentially ergodic: the mean coalescence time 
across a long chromosome converges to the mean coalescence time at a single point. A similar proof could 
be given for the discrete-locus ARG with evenly spaced loci, which has a covariance function of the same 
form as the continuous-chromosome ARG. 


1.3. Supplementary Figures 



Figure SI: Schematic of the ARG back-in-time Markov process for two loci. The process starts in state Rq, and transitions to 
other states occur with the rates indicated by arrows between states. 
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Figure S2: Back-in-time Markov process for generating a coalescence time T 2 at the right locus conditional on the time Ti = £1 
at the left locus under the SMC’. Starting at time zero in state Ro, the process follows the transitions indicated by the solid 
arrows at the rates accompanying these arrows. Transitions indicated by dotted arrows are followed instantaneously at time £ 1 . 
See |HOBQLTH and Jense^ | |201^ for analogous processes for the ARG and SMC models. 
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